-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MARC Author name importing #9797
base: master
Are you sure you want to change the base?
Conversation
77b4d4c
to
4f18edd
Compare
3171df3
to
3d44aab
Compare
9fc7c52
to
f5dc6c5
Compare
openlibrary/catalog/marc/tests/test_data/bin_expect/uoft_4351105_1626.json
Outdated
Show resolved
Hide resolved
ba029fb
to
08c92bd
Compare
mypy is wasting my time on this. it's ridiculous that these tests can't run locally in Docker, which is how everything else in the project runs, mypy is available in the container via requirements_test.txt , but it's not functional. It's presence in the container seems to be an accident. This project uses Docker to take advantage of a controlled dev and test environment, I do not use venv for this project, because there is no need to. I'm not even a great fan of mypy type checking, it's a bit of a disaster on this project which still has tons of legacy code which has barely been upgraded from Python 2. It's forcing to put bandaids on unrelated code that functions, when much of it needs a deep refactor. It's frustrating to have to type hint everything when you've added a few hints to code you understand and are actively changin. I have backed out type hints on some methods before because it opened a can of worms of errors I just couldn't untangle all the way back to source. There seems to be no integration between the Docker env, the git pre-commit env, and the local test env (which apparently is something that is required to be outside of Docker), it's not documented how to run the full test suite locally, and I think having a separate test/dev environment to run half the tests is a mistake, even if it were documented. |
mypy's error messages aren't even very good. The last error is:
It seems to be incorrectly inferring
|
6c1813b
to
55c8002
Compare
idk, now these tests are failing non-deterministically because of 500s from some external Vercel web app I've never heard of because ... javascript bundle size. What kind of testing hell is this? How many more gates? |
mypy test already run as part of the pre-commit hooks, and the output is easier to read there. It just adds duplicate noise to the tests-py output
I have rebased against the main branch and that has fixed the javascript tests. I have also squashed the pre-commit CI test noise changes, and now all tests are passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick pass through and it's definitely an improvement. The two issues that I see are:
- figuring out which 7xx's are authors and which aren't
- normalizing contributor roles / relators
I don't claim to have magic solutions for either, so just highlighting the issues.
"name": "Homer", | ||
"entity_type": "person" | ||
}, | ||
{ | ||
"name": "Buckley, Theodore William Aldis", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The by_statement indicates that this is a translator, who probably shouldn't be promoted to author, but I'm not sure how this case can be detected with at natural language processing of the by statement.
"birth_date": "1849", | ||
"name": "Cowles, Calvin D.", | ||
"entity_type": "person", | ||
"role": "comp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Standard MARC relator is "com". Is this an OpenLibrary specific code or are these not being normalized?
"birth_date": "1787", | ||
"death_date": "1855", | ||
"entity_type": "person", | ||
"role": "tr. [and] ed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these to be translatable, they're going to need to be standardized/normalized. MARC relators are "trl" and "edt"
"title": "gaon", | ||
"personal_name": "Yehudai ben Na\u1e25man", | ||
"personal_name": "Yehudai ben Naḥman", | ||
"role": "supposed author", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the standard relator for this might be "att" for "Attributed name"
"name": "Schlosberg, Leon", | ||
"date": "d. 1899", | ||
"entity_type": "person", | ||
"role": "ed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edt
"birth_date": "1849", | ||
"name": "Cowles, Calvin D.", | ||
"entity_type": "person", | ||
"role": "comp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
com
"birth_date": "1787", | ||
"death_date": "1855", | ||
"entity_type": "person", | ||
"role": "tr. [and] ed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trl
edt
@tfmorris good comments on the various role abbreviations. I was trying to get this merged to get the test files mostly stable before addressing the author role expansion. There was existing functionality that took The I think most of your comments are covered by the new PR, except maybe "att" for "Attributed name" -- I'll look into that further. Again, I don't really understand why there are two almost identical realtor / relationship subfields. The My plan was to:
If this PR is merged, the rebased version of #9901 should make these changes clear, and we can discuss role expansion in detail there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hornc, everything looks good to me, save for the removal of mypy
running as part of the GitHub workflow. I will ask about that today as I don't think that's a unilateral decision I can make, though I will note that it has caused you a lot of trouble here.
We just chatted about this and:
@scottbarnes to expand :P |
@@ -48,8 +48,6 @@ jobs: | |||
run: | | |||
git fetch --no-tags --prune --depth=1 origin master | |||
make test-py | |||
source scripts/run_doctests.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hornc, I brought this up at the weekly ABC call and as Drini mentioned removing mypy
is okay here, but we'd like to keep running the doctests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottbarnes the majority of the tests run in source scripts/run_doctests.sh
are already running in make test-py
Running them twice does not add value, it simply takes longer and gives a false impression of 'doing more testing'. It also makes it harder to debug and locate when a test fails if it is failing in two locations.
In general, I don't think the OL project's doctests are maintained or current. The majority of the tests run in /run_doctests.sh
are not doctests anyway.
The correct place to put OL Python tests are in the standard test files to be run via make test-py
I believe that this has already been done, but it's hard to review since the bulk of what runs is simple duplication.
The ignore list in scripts/run_doctests.sh
seems totally arbitrary. This script should be deprecated, and if there is anything still of value in doctests, they should be highlighted and moved to more appropriate locations, so they can be maintained properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for pointing out that the they're already running from make test-py
. I will try to look at this more closely tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds more tests and improves string handling on MARC imports.
Fixes Some org author names are being incorrectly rearranged around commas on import #9789 Some org author names are being incorrectly rearranged around commas on import -- fixes the alternate script issues around 7XX fields. The org splitting was already fixed in Sort out Author types on import #9601
Fixes MARC 100 vs 700 author / contributor inconsistency #7723
Removes redundant
personal_name
field if it is an exact duplicate ofname
Removes
contributions: str
field and usesauthor
, androle
(when it is specified in source)Uses original script for author names, and places transliterations under
alternate_names
Replaces Unicode \u00AB strings with UTF-8 in test data for ease of verification / understanding the expected results
It seems 7xx has no direct relation to a
contributor
2nd-class author. Most of the existing examples represent equal co-authors.Also, it looks like
contributor: str
is a deprecated field given internetarchive/openlibrary-client@7e45d51It's not yet clear to me how to identify a statement of responsibility as either an Author or Contributor. The current
Author
schema supports "roles", which matches source data better, e.g. 700$e " Relator term".Initial Questions, and this PR's answers:
contributons
? ... no one, deprecate this field.contributions: str
field isn't used by the UI(?), and should be replaced by the more informativecontributor
-- see internetarchive/openlibrary-client@7e45d51alternate_names
Technical
Testing
Screenshot
Stakeholders
@tfmorris